Github link: https://github.com/TerryTian21/JSC370-Final-Project
A report by the FRED (Federal Reserve of St. Louis) on labour market conditions highlighted a drastic change in software engineer job postings within the past 5 years. Indexed on Feb 1, 2020 = 100, the number of postings exponentially increases, peaking in early 2022 (Index = 240). Yet, seemingly just as rapid, the number of postings fell to a low in late 2023. With numerous tech-unicorns announcing layoffs. The tech bubble has appeared to burst. This paper will evaluate the software/data engineering market in 2021 and 2023, showing the differences in available roles, postings by location, and employee skill-set requirements.
Figure 1: Line chart of Indeed jobs postings with baseline Feb 1, 2020. The chart is seasonally adjusted on historic patterns in 2017-2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately.
The two main questions of interest are as follows:
The goal of this paper is to provide clarity into why so many engineers have been struggling to find employment opportunities in North America. Three datasets are used to answer the above questions; 2 Kaggle datasets and a dataset found on Github.
The first Kaggle dataset was procured by Yazeed Fares, titled Software Engineering Jobs Dataset. The dataset contains 9380 observations and 8 features and was collected via scrapping LinkedIn Jobs. Although the scrapping was performed on Dec. 25, 2023, not all jobs were posted on that specific date. LinkedIn retains job postings for up to 6 months. For the purpose of this exploration, we will claim this is a reasonable sample of job postings in 2023.
The second dataset uploaded by Arsh Koneru, Zoey Yu Zou contains a comprehensive aggregation of LinkedIn Job postings in 2023/2024. This dataset contains a total of 11 .csv files initial stored as tables in a SQL database. However, we are only interested in posting metadata. As a result information on companies, industries, benefits are disregarded. The primary purpose of this dataset, is to supplement salary information for software engineering job postings (not contained in dataset 3).
The third dataset was developed by Mike Lawrence, a Machine Learning Engineer at Google. The dataset contains 8261 observations and 13 features. Similarly, this dataset was also collected from scraped LinkedIn postings; collected in October 2021.
Since the dimensions of each dataset is different, the first step is to subset datasets into matching features for the purpose of comparison. The variables of interest are listed below. After subsetting the data, all NA values are removed.
| Variables | Type | Description |
|---|---|---|
| Company | character | Name of Company |
| Description | character | Description of job including but not limited to company overview, requirements, skillset |
| Title | character | Name of position |
| Location | character | Location of Job |
| Seniority | character | Classification of role based on experience, technical expertise, leadership responsibilities |
| Year | factor | Year the Job was Posted |
Titles
Due to the structure of job titles additional data-wrangling is
required to get proper categorization. For example, variance between
each posting could result in different titles representing the same type
of position (e.g. Sr. Software Engineer vs. Senior Software Engineer).
This would affect group_by() functions, resulting in many
more categories than necessary. Thus, custom title’s are defined based
on keyword matches. Using the case_when() function, titles
are classified from specific to generic. A title like “Front-end
Software Engineer” gets classified as Frontend Software Engineer rather
than Software Engineer.
| Title | Pattern |
|---|---|
| Back End Engineer | Back-end, backend |
| Cloud Engineer | Cloud |
| Data Engineer | Data |
| Data Scientist | Data Scientist |
| DevOps Engineer | Devops |
| Embedded Systems Engineer | Embedded, System |
| Front End Engineer | Front-end, front end, frontend |
| Full Stack Engineer | full stack|full-stack |
| Machine Learning Engineer | Machine Learning, AI, Artificial Intelligence |
| Mobile Software Engineer | Mobile, iOS, Android |
| Other | .* |
| QA Engineer | Test, Quality, QA |
| Research Engineer | Research, Scientist |
| Security Engineer | Security, Cyber |
| Site Reliability Engineer | Site Reliability, site-reliability |
| Software Engineer | Software |
Seniority
Analogous to titles, seniority levels are rather inconsistent between the 2 datasets. The 2021 dataset has 8 levels of seniority while the 2023 dataset only contains 2 classifications. In order to maintain homogeneity between the classifications in both datasets, custom Seniority levels are defined based on keywords in the title (e.g. Staff ~ Staff Level).
| Seniority | Pattern |
|---|---|
| Principal | Principal |
| Staff | Staff |
| Lead | Lead |
| Senior | Sr., Sr, Senior, III |
| Founding | Founding |
| Manager | Manager |
| Junior | Entry Level, Junior, Entry-Level, Graduate, Jr., II, Jr, I |
| Junior | Entry level, Associate |
| Senior | Mid-Senior level |
| None Specified | .* |
GeoData
Additional wrangling was required to plot Postings Count ~
Location. The provided location data only contains posting location
formatted by City, State. However, graphing libraries
(ggmap) require latitude and longitude values to plot
location data. With some experimentation, the most effective
visualization utilized State level groupings to plot posting data. Step
1 was to use regex, and extract state abbreviation from each location
string. For the remaining data which contained State information, Google
Geocoding API was used to translate coordinates for each state. Due
to some postings residing outside the United States and unparsable
location data, it was not possible to each a State value for every
location. 7258 from 2021 and 6948 from 2023 remained available for
plotting.
The following figures represent EDA on our variables of interest. All variables are either factors or textual, hence visualizations are limited to bar charts listing the (top-n) counts grouped by each feature.
Figure 5 shows the comparison of postings in 2021 to postings in 2023. Note that this list is not exhaustive of all postings in 2021/2023 and shouldn’t be taken as contradictory evidence to the hypothesis. At the time of collection, this was the number of postings available on LinkedIn. It is very possible that the scrapper missed postings, or volumes are lower/higher at the given point of time the scrapper aggregated the dataset.
Figure 5: Comparison of number of postings in 2021 and 2023.
‘Software Engineer’ was the most popular role in both years, but the key difference is a lack of specificity for 2023 roles. The 2021 dataset has over 1000 occurances of MLE, Site Reliability Engineer and Data Scientist postings while the second most occurances in the 2023 dataset is 700 postings of Embedded Systems Engineers. If we disregard the notion that the 2023 dataset wasn’t scrapping for Data Science related roles, the difference between the 2 years is more reasonable.
Figure 6: Comparison of top 10 posting titles in 2021 cs 2023.
One noticeable difference, however, is the desired seniority level. In 2021 there were 2687 postings for entry-level/junior engineer roles and was the most frequent seniority. However, The 2023 dataset saw a large shift towards senior roles with the vast majority of postins being for Senior Engineer (4183) and also increased increased postings for Staff, Principal and Lead Engineer roles.
Figure 7: Comparison of seniority counts for postings in 2021 vs 2023
Althought its possibly attributed to timing of data collection, companies posting in 2021 are more traditional “big-tech” while most postings from 2023 are scattered. In 2021, Apple posted 600 openings followed by Microsoft, Uber, Salesforce all with ~100 postings respectively. 2023 saw a different demographic. There was a lack of postings from popular Saas and Tech Giants. In fact, the most number of postings in 2023 comes from Jobs for Humanity - a platform for “Connecting historically under represented talent to welcoming employers across the globe”. This is not anomalous. According to Layoffs.fyi 1,036 tech companies laid off a total of 238,397 employees in the first nine months of 2023. Therefore, we would expect to see more postings come from niche sectors like U.S. Defence (Northrop Grunman) and recruiting agencies ( Recruiting from Scratch & IP Recruiter Group).
Figure 8: Companies of company counts in 2021 vs 2023.
Figure 9 represents the geographic locations of job postings. Each bubble, is indicative of the number of postings in the given state; relative to other bubbles on the map. There isn’t a significant difference between the two years, both seeing the largest number of postings in California, Texas and New York - the “tech hubs” of the U.S.
Figure 9: Maps of The United States showing relative postings counts by state.
From an initial breakdown of datasets and variables, it was discovered that 2021 and 2023 saw differences in metadata associated with postings. Location counts, was the only consistent factor between the two years, while Titles, Seniority and Companies data all support the argument of an increased difficulty for job-seekers in 2023. Many popular destinations didn’t post opportunities for New Grads / Entry Levels and sought more senior or leadership positions.
The next step of the project will tackle NLP and Prediction models. Job skillsets, years of experience, and sentiment can be extracted and compared for the two years. An additionaly dataset (dataset 2) containing salaries of SWE jobs in 2023 will be introduced to compare wages. A numeric feature, allows for further exploration of variable relationships such as Title~Salary, Location~Salary, Company~Salary. Resultingly, MLR, GLMM, and Boosting models will be trained on the new data to answer question 2 of our hypothesis - salary prediction.